Exploratory and Explanatory Data Visualization for Education

SDP SILA February 2013

Jared Knowles
Research Analyst, Wisconsin DPI

The Problem of Data Visualization

  • Data use is increasing rapidly within the education space
  • Policymakers are under increasing pressure to use data to inform decisions, justify funding, and guide practice
  • But, policymakers are often not statisticians, researchers, or quants
  • Data visualization is a way to bridge this gap
  • Proper data visualization will bring the data to the audience in a way they can understand quickly and use to inform decisions

Follow Along

What is dataviz?

Dataviz is...

  • An exploratory tool for understanding datasets
  • A communication tool for framing decisions and depicting problems
  • A way to showcase
  • A better way to present results of analyses

Dataviz is not...

  • Easy
  • A replacement for analysis
  • Infographics
  • Easy!

Data visualization is a tool for communicating a specific feature of a datset in an approachable and efficient manner

If a picture is worth a thousand words, a good data visualization must always be better than a table.

Student Growth

Objectives

  1. Review data visualization principles and chart types
  2. Techniques for plotting "big data"
  3. Look at applications in education data from an SEA
  4. Your graphs are ugly!
  5. Best practices and advice
  6. What tools to use
  7. Activity!

Determining the Audience

  • Making good graphics starts with considering your audience carefully
  • Find the level of complexity you think your audience can understand, and push it
  • Think about the point you want to get across and make sure it is crystal clear
  • Looking cool is not enough, focus on conveying meaning
  • Remember, the audience rarely has the understanding of data you have from the dozens of prior graphics you have made from the data
  • Do the work for them!

Exploratory vs. Explanatory Graphics

Exploratory graphics

  • Display relationships and patterns of interest to analysts
  • Are incredibly disposable
  • Emphasize quick iteration and inspection
  • Are not polished

Explanatory graphics

  • Are crafted around a single clear message
  • Carefully rendered in high quality formats
  • Well labeled, self-contained, and self-explanatory
  • Intended to be accessible to non-experts

An Exploratory Plot

How can we improve this simple scatterplot?

plot of chunk plot

Principles for Making it Explanatory

  • Elements of a chart
  • Chart Types and Data Types
  • Dimensionality
  • Scale
  • Complexity
  • Technical details
  • Beyond charts

plot of chunk plot1

Chart Elements

There are a few things that all charts need. There are sometimes strong cases to deviate from these, but they are good rules of thumb.

  • Axis labels and a title
    • These make the chart self-explanatory
  • A legend
    • What is the unit in the graphic?
  • A scale
    • How are units mapped to the visual space
  • Annotations
    • Author and data source (depending on distribution)

Dimensions

  • Charts and data are made up of dimensions (e.g. a bar chart is x and y)
  • Additional dimensions can be represented by additional aesthetics or chart elements (e.g. color, size, shape, etc.)
  • Dimensions can also be shown by multiple plots (e.g. a filmstrip)
  • Smart use of dimensions allows us to increase the information density of our charts

plot of chunk unnamed-chunk-1

How you turn dimensions in the data into visual cues for your audience is everything.

Reviewing Chart Types

Stacked Bar

Box and Whisker

Bullet Chart

Calendar

Lines

Parallel Coordinates

Parallel Sets

Tree Map

Word Cloud

Data Types

  • Any given dimension may be measured at different levels of measure [derived by Stanley Smith Stevens in the 1940s and 50s]
    • Nominal: unordered categories of data (e.g. race)
    • Ordinal: ordered categories of data, relative size and degree of difference between categories is unknown (e.g. Likert scales, proficiency levels perhaps)
    • Interval: ordered categories of data, fixed width, like discrete temperature scales (e.g. grade in school)
    • Continuous (ratio): a measurement scale in a continuous space with a meaningful zero - physical measurements (e.g. scale scores)

Mapping Levels of Measure to Visual Cues

Aesthetics for Mapping

How do we map levels of measurement onto visual features of charts?

Aesthetic Discrete Continuous
Color Disparate colors Sequential or divergent colors
Size Unique size for each value mapping to radius of value
Shape A shape for each value does not make sense

Ordered vs. Unordered

Aesthetic Ordered Unordered
Color Sequential or divergent colors Rainbow
Size Increasing or decreasing radius does not make sense
Shape does not make sense A shape for each value

Example

plot of chunk plot1.1plot of chunk plot1.1plot of chunk plot1.1plot of chunk plot1.1plot of chunk plot1.1plot of chunk plot1.1

Think like a map. Data density and easy interpretability.

Maps

Some tips

  • Focus on the content
  • Consider your audience
  • Use best practices
  • Understand the limitations
  • Experiment and iterate!

Examples

Charting Categorical Data

plot of chunk unnamed-chunk-2

Charting Ordinal Data

plot of chunk unnamed-chunk-3

Charting Interval/Continuous Data

plot of chunk unnamed-chunk-4

Complexity

How do we display a ton of data--tens or hundreds of thousands of observations--with combinations of data types across multiple dimensions?

  1. Summarize the data
    • Display summary statistics visually depicting the central tendency and spread of data
  2. Plot the raw data
    • Annotate wisely to display the main message
  3. Model the data
    • Use a statistical model to summarize features of the data

Let's look at some examples of this.

Summarizing Data

  • The most simple summaries are measures of central tendency, most easily understood
  • It is important to look at the spread of data too though
  • If time is of interest, we are interested in trends
  • If space is of interest, we are interested in maps or spatial distributions
  • Think about context and reference
  • Let's look at an example summarizing student data to schools!

Plotting Means

Here is a simple plot of mean school reading scores:

plot of chunk plotmeans

But, what's wrong with this plot?

Mistakes

  • No sense of scale
  • Means can be skewed
  • Simple means are not meaningful
  • With assessment scores we need to know grade distribution
  • Let's try to improve this

plot of chunk plotmeanssmall

Adding a Dimension

plot of chunk meanplot2

Even More Dimensions

plot of chunk meanplot3

Annotation

We still aren't sure what the mean scale score means. Let's see a couple more additions that can make this useful.

plot of chunk meanplot4

Order

This still requires too much work for the viewer - if we order the plots we get a caterpillar plot that can be more useful!

plot of chunk meanplot5

Raw Data

Sometimes, we can get away with showing the raw data, that is, all data points. We may want to do this for a few reasons:

  • the "wow" effect,
  • because it is easier,
  • or because it looks better aesthetically.

How could it be done?

600,000 Observations Too Many

plot of chunk rawdata1

Spread the Data Out

  • Without reducing the data points we need to do three things to be successful
  1. Spread the data out
    • These points overlap each other and make a mess
  2. Reduce the ink
    • Each point has too much "weight"
  3. Add Reference Points
    • 600,000 observations in one panel is not meaningful
  • Edward Tufte and others recommend small multiples, a technique of repeating a plot across groups to compare relationships in multiple dimensions

What About This

plot of chunk rawdata2

Even Smaller Multiples

plot of chunk rawdata3

Binning Data

plot of chunk rawdata4

Modeling the Data

All models are wrong. Some models are useful.

Smoothers

plot of chunk unnamed-chunk-5

Machine Learning

Regression Trees

Regression Results

Path Diagrams

Illustrating a Model

Effect Plot

plot of chunk effplot1

Simulation

  • When depicting a statistical model, simulation can be used to help show substantive impact of model features
  • Instead of suggesting a program has a 0.2 standard deviation effect, show the change in student test score that results
  • Especially important if you have interaction or non-linear effects
  • Gelman & Hill 2006 - Data Analysis Using Regression and Multilevel/Hierarchical Models is a good primer

Sim Example

Combining Features

We can combine these features.

  • Facets with smoother lines for references (small multiples + models)
  • Summary plots with raw data in the background
  • Reference lines and group comparisons

Animation Example

Why does this work?

  • Annotation
  • Labeling
  • Lots of data-ink
  • Reference points galore

What would make it better?

  • Interactivity - with this much information, allowing the user to go back and forth on their own would be much more useful
  • Size - most users don't have high resolution displays, but these can be helpful in making things more clear
  • Adjustments to the modeling process

Bad Graphs are Everywhere

When you can, avoid making graphs that look like these that follow.

Bad Scales

Unnecessary Dimensions

Too Cluttered

Style over Substance

Redundant

Some tips

  • Have a properly chosen format and design
  • Use words, numbers and drawing together
  • Reflect a balance, a proportion, relevant scale
  • Display an accessible complexity of details
  • Have a narrative quality, tell a story
  • Avoid content-free decoration (Tufte's proverbial chartjunk)
  • Draw in a professional manner with an eye on the technical details
  • Remember the map

Themes

They can communicate, confound, brand, and distract

plot of chunk plot2plot of chunk plot2plot of chunk plot2plot of chunk plot2

Technical Details

Getting the technical details right is essential to preventing users from being distracted by the production and miss the story.

Graphics Files

Raster

  • Files like jpg , png , gif.
  • Fixed scale, aspect ratio, and size
  • Reasonable file size
  • Viewable in almost any image viewing and editing system, including any modern web browser, PowerPoint, etc.

Vector

  • Files like pdf and svg
  • Infinitely zoomable, adjustable on the fly
  • Larger file size
  • Viewable and usable in fewer systems. SVGs can be used in modern web browsers. PDFs included in other PDF reports.

Technologies

The technology you choose to do visualizations is largely a question of personal productivity, but with some important caveats:

  • In the future, more and more content is going to be delivered in a paperless world, so pick a technology that can leverage web/tablet/phone interfaces
  • Different technologies are useful for different levels of finish and polish; Adobe Illustrator is great for publication quality graphics, R is a great tool for rapidly prototyping different visualizations
  • Chose a technology that best serves your consumer, not you are the producer. Charts are a service to the consumer, not to the creator.

Some Technologies

plot of chunk technologies

Programming vs. Illustrating

Keep in mind that depending on the project you may need to programatically make data visualizations, or you may need a highly customized illustrated graphic.

Beyond Graphics

We have a number of other techniques we can use beyond simple charts.

  • Animations
  • Interactive demos
  • Summary tables
  • Videos
  • Web sites

Group Exercise

Visualize some education data. Imagine we have the following dimensions and want to present more of them on a plot like that on the right. Sketch out your result with your group.

  • Grade
  • Disability Status
  • Type of Disability
  • Language Proficiency
  • School
  • Math Score
  • Reading Curriculum
  • Math Curriculum

plot of chunk studentexample

Example

plot of chunk studentexample2

References

Where to Learn Online?

Review of Key Concepts

  • Dimensionality
  • Aesthetics and Mappings
  • Small multiples
  • Spreading the data out
  • Web vs. print
  • Adapt and iterate
  • Modeling the data
  • Themes and style
  • Techniques and software

Backmatter

print(sessionInfo(),locale=FALSE)
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] arm_1.6-10           lme4_1.0-5           Matrix_1.1-0        
 [4] datasynthR_0.1       reshape_0.8.4        effects_2.3-0       
 [7] colorspace_1.2-4     lattice_0.20-24      mgcv_1.7-27         
[10] nlme_3.1-113         plyr_1.8             vcd_1.3-1           
[13] ggthemes_1.5.1       eeptools_0.3         MASS_7.3-29         
[16] ggplot2_0.9.3.1      knitr_1.5.15         slidifyLibraries_0.1
[19] markdown_0.6.4       whisker_0.3-2        slidify_0.3.3       
[22] devtools_1.4.1      

loaded via a namespace (and not attached):
 [1] abind_1.4-0        car_2.0-19         coda_0.16-1       
 [4] data.table_1.8.10  dichromat_2.0-0    digest_0.6.4      
 [7] evaluate_0.5.1     foreign_0.8-57     formatR_0.10      
[10] gtable_0.1.2       httr_0.2           labeling_0.2      
[13] maptools_0.8-27    memisc_0.96-9      memoise_0.1       
[16] minqa_1.2.1        munsell_0.4.2      nnet_7.3-7        
[19] parallel_3.0.2     proto_0.3-10       RColorBrewer_1.0-5
[22] RCurl_1.95-4.1     reshape2_1.2.2     scales_0.2.3      
[25] sp_1.0-14          splines_3.0.2      stringr_0.6.2     
[28] tools_3.0.2        yaml_2.1.7        

For Fun

Ugly All HTML5 Graphic